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Abstract 


Unlike perfect-information games, imperfect-information 
games cannot be decomposed into subgames that are 
solved independently. Thus more computationally intensive 
equilibrium-finding techniques are used, and abstraction— 
in which a smaller version of the game is generated and 
solved—is essential. Endgame solving is the process of com- 
puting a (presumably) better strategy for just an endgame 
than what can be computationally afforded for the full game. 
Endgame solving has many benefits, such as being able to 
1) solve the endgame in a finer information abstraction than 
what is computationally feasible for the full game, and 2) in- 
corporate into the endgame actions that an opponent took that 
were not included in the action abstraction used to solve the 
full game. We introduce an endgame solving technique that 
outperforms prior methods both in theory and practice. We 
also show how to adapt it, and past endgame-solving tech- 
niques, to respond to opponent actions that are outside the 
original action abstraction; this significantly outperforms the 
state-of-the-art approach, action translation. Finally, we show 
that endgame solving can be repeated as the game progresses 
down the tree, leading to significantly lower exploitability. 
All of the techniques are evaluated in terms of exploitabil- 
ity; to our knowledge, this is the first time that exploitability 
of endgame-solving techniques has been measured in large 
imperfect-information games. 


Introduction 


Imperfect-information games model strategic settings that 
have hidden information. They have a myriad of applications 
such as negotiation, shopping agents, cybersecurity, physical 
security, and so on. In such games, the typical goal is to find 
a Nash equilibrium, which is a profile of strategies—one for 
each player—such that no player can improve her outcome 
by unilaterally deviating to a different strategy. 

Endgame solving is a standard technique in perfect- 
information games such as chess and checkers (Bellman 
1965). In fact, in checkers it is so powerful that it was used 
to solve the entire game (Schaeffer et al. 2007). 

In imperfect-information games, endgame solving is dras- 
tically more challenging. In perfect-information games it 
is possible to solve just a part of the game in isolation, 
but this is not generally possible in imperfect-information 
games. For example, in chess, determining the optimal re- 
sponse to the Queen’s Gambit requires no knowledge of the 
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optimal response to the Sicilian Defense. To see that such 
a decomposition is not possible in imperfect-information 
games, consider the game of Coin Toss shown in Figure 1. 
In that game, a coin is flipped and lands either Heads or Tails 
with equal probability, but only Player | sees the outcome. 
Player 1 can then choose between actions Left and Right, 
with Left leading to some unknown subtree. If Player 1 
chooses Right, then Player 2 has the opportunity to guess 
how the coin landed. If Player 2 guesses correctly, Player 1 
receives a reward of —1 and Player 2 receives a reward of 1 
(the figure shows rewards for Player 1; Player 2 receives the 
negation of Player 1’s reward). Clearly Player 2’s optimal 
strategy depends on the probabilities that Player 1 chooses 
Right with Heads and Tails. But the probability that Player 1 
chooses Right with Heads depends on what Player | could 
alternatively receive by choosing Left instead. So it is not 
possible to determine what Player 2’s optimal strategy is in 
the Right subtree without knowledge of the Left subtree. 


Figure 1: The example game of Coin Toss. “C” represents a 
chance node. S is a Player 2 (P2) information set. The dotted 
line between the two P> nodes means P> cannot distinguish 
between the two states. 


Thus imperfect-information games cannot be solved via 
decomposition as perfect-information games can. Instead, 
the entire game is typically solved as a whole. This is a prob- 
lem for large games, such as No-Limit Texas Hold’em— 
a common benchmark problem in imperfect-information 
game solving—which has 10165 nodes (Johanson 2013). 
The standard approach to computing strategies in such large 
games is to first generate an abstraction of the game, which 
is a smaller version of the game that retains as much as pos- 


sible the strategic characteristics of the original game (Sand- 
holm 2010). This abstract game is solved (exactly or ap- 
proximately) and its solution is mapped back to the original 
game. In extremely large games, a small abstraction typi- 
cally cannot capture all the strategic complexity of the game, 
and therefore results in a solution that is not a Nash equi- 
librium when mapped back to the original game. For this 
reason, it seems natural to attempt to improve the strategy 
when a sequence farther down the game tree is reached and 
the remaining subtree of reachable states is small enough to 
be represented without any abstraction (or in a finer abstrac- 
tion), even though—as explained previously—this may not 
lead to a Nash equilibrium. While it may not be possible 
to arrive at an equilibrium by analyzing subtrees indepen- 
dently, it may be possible to improve the strategies in those 
subtrees when the original (base) strategy is suboptimal, as 
is typically the case when abstraction is applied. 

We first review prior forms of endgame solving for 
imperfect-information games. Then we propose a new form 
of endgame solving that retains the theoretical guarantees of 
the best prior methods while performing better in practice. 
Finally, we introduce a method for endgame solving to be 
nested as players descend the game tree, leading to substan- 
tially better performance. 


Notation and Background for 
Imperfect-Information Games 


In an imperfect-information extensive-form game there is a 
finite set of players, P. H is the set of all possible histo- 
ries (nodes) in the game tree, represented as a sequence of 
actions, and includes the empty history. A(h) is the actions 
available in a history and P(h) € P U cis the player who 
acts at that history, where c denotes chance. Chance plays 
an action a € A(h) with a fixed probability o,(h, a) that is 
known to all players. The history h’ reached after an action 
is taken in h is a child of h, represented by h-a = h’, while h 
is the parent of h’. If there exists a sequence of actions from 
h to h’, then h is an ancestor of h’ (and h’ is a descendant 
of h). Z C H are terminal histories for which no actions are 
available. For each player i € P, there is a payoff function 
u;i : Z > R. If P = {1,2} and u, = —ue, the game is 
two-player zero-sum. 

Imperfect information is represented by information sets 
(infosets) for each player i € P by a partition Z; of h € H : 
P(h) = i. For any infoset I € Z;, all histories h, h’ € I are 
indistinguishable to player i, so A(h) = A(h’). I(h) is the 
infoset J where h € I. P(J) is the player i such that I € Z;. 
A(T) is the set of actions such that for all h € I, A(T) = 
A(h). |A;| = maxyez, |A(Z)| and |A| = max; |Aj|. 

A strategy o;(I) is a probability vector over A(T) for 
player 7 in infoset J. The probability of a particular action 
a is denoted by o;(Z,a). Since all histories in an infoset 
belonging to player ¿ are indistinguishable, the strategies 
in each of them must be identical. That is, for all h € J, 
cilh) = clI) and o;(h,a) = o;(J,a). A full-game strat- 
egy o; € X}; defines a strategy for each infoset belonging to 
Player 2. A strategy profile ø is a tuple of strategies, one for 
each player. u;(o;,_;) is the expected payoff for player i 


if all players play according to the strategy profile (o;i, 7_;). 

m?(h) = Hw.achap(n) (h,a) is the joint probability of 
reaching h if all players play according to ø. n7 (h) is the 
contribution of player 7 to this probability (that is, the prob- 
ability of reaching h if all players other than 7, and chance, 
always chose actions leading to A). 77 ;(h) is the contribu- 
tion of all players other than 7, and chance. 77 (h, h’) is the 
probability of reaching h’ given that h has been reached, and 
Oif h ¢ kM. Ina perfect-recall game, Vh,h’ € I € Ti, 
milh) = 7;(h’). In this paper we focus specifically on 
two-player zero-sum perfect-recall games. Therefore, for 
i = P(I) we define 7;(1) = m;(h) for h € I. Moreover, 
I' C Lif for some h’ € T and some h € J, h’ C h. Simi- 
larly, l'-a C I if h’ -a C h. We also define 7° (J, I’) as the 
probability of reaching I’ from J according to the strategy 
o. 

For convenience, we define an endgame. If a history is in 
an endgame, then any other history with which it shares an 
infoset must also be in the endgame. Moreover, any descen- 
dent of the history must be in the endgame. Formally, an 
endgame is a set of histories S C H such that for all h € S, 
if h Ch’, then h’ € S, and for all h € S, if h’ € I(h) for 
some I € Zp(n) then h’ € S. The head of an endgame S, is 
the union of infosets that have actions leading directly into 
S, but are not in S. Formally, S, is a set of histories such 
that for all h € S,,h ¢ S and either da € A(h) such that 
h—->ae€S,orh € I and for some history h’ € I, h’ € Sp. 

A Nash equilibrium (Nash 1950) is a strategy profile 
o* such that Vi, u(o7,0%;) = maxgrey, uilo; o*;). 
An e€-Nash equilibrium is a strategy profile o* such that 
Vi, uiloz,o*%,;) +€ > maxgey, ui(o{,o*,). In two-player 
zero-sum games, every Nash equilibrium results in the same 
expected value for a player. A best response BR;(o_;) 
is a strategy for player i such that u;(BR;(o_;),0_i) = 
MaXg/cy; Ui(7};,7-i). The exploitability exp(o—;i) of a 
strategy o_; is defined as u;(BR;(c_;),o0_i) — ui(o*), 
where o* is a Nash equilibrium. 

A counterfactual best response (Moravcik et al. 2016) 
CBR;(c_;) is similar to a best response, but additionally 
maximizes counterfactual value at every infoset. Specifi- 
cally, a counterfactual best response is a strategy g; that is a 
best response with the additional condition that if o;(I,a) > 
0 then vf (I, a) = maxa v? (I, a’). 

We further define counterfactual best response 
value CBV°-i(I) as the value player i expects 
to achieve by playing according to CBR;(a;) 
when in infoset J. Formally CBV°?-‘(J,a) 


Eiir (a (h) eee (mC BM lo29) 229) (h a, z)ui(2))) 
and CBV- (I) = MaXge A (I) CBV®-:(I, a). 


Prior Approaches to Endgame Solving in 
Imperfect-Information Games 


In this section we review prior techniques for endgame solv- 
ing in imperfect-information games. Our new algorithm then 
builds on some of the ideas and notation. 

Throughout this section, we will refer to the Coin Toss 
game shown in Figure 1. We will focus on the Right 


endgame. If P, chooses Left, the game continues to a much 
larger endgame, but its structure is not relevant here. 

We assume that a base strategy profile o has already been 
computed for this game in which P, chooses Right 3 of the 
time with Heads and 4 of the time with Tails, and P> chooses 
Heads 4 of the time, Tails ; of the time, and Forfeit ; of the 
time after P, chooses Right. The details of the base strategy 
in the Left endgame are not relevant in this section, but we 
assume that if P, played optimally then she would receive 
an expected payoff of 0.5 for choosing Left if the coin is 
Heads, and —0.5 for choosing Left if the coin is Tails. We 
will attempt to improve P2’s strategy in the endgame that 
follows P, choosing Right. We refer to this endgame as S. 


Unsafe Endgame Solving 


We first review the most intuitive form of endgame solving, 
which we refer to as unsafe endgame solving (Billings et 
al. 2003; Gilpin and Sandholm 2006; 2007; Ganzfried and 
Sandholm 2015). This form of endgame solving assumes 
that both players will play according to their base strategies 
outside of the endgame. In other words, all nodes outside 
the endgame are fixed and can be treated as chance nodes 
with probabilities determined by the base strategy. Thus, the 
different roots of the endgame are reached with probabili- 
ties determined from the base strategies using Bayes’ rule. A 
strategy is then computed for the endgame—independently 
from the rest of the game. Applying unsafe endgame solving 
to Coin Toss (after P; chooses Right) would mean solving 
the game shown in Figure 2. 


Figure 2: The game solved by Unsafe endgame solving to 
determine a P, strategy in the Right endgame of Coin Toss. 


Specifically, we define R as the set of earliest-reachable 
histories in S. That is, h € Rif h € S andh’ ¢ S for 
any h’ C h. We then calculate 77(h) for each h € R. A 
new game is constructed consisting only of an initial chance 
node and S. The initial chance node reaches h € R with 
probability Sa a y This new game is solved and its 


strategy is then used whenever S' is encountered. 

Unsafe endgame solving lacks theoretical solution quality 
guarantees and there are many situations where it performs 
extremely poorly. Indeed, if it were applied to the base strat- 
egy of Coin Toss, it would produce a strategy in which Pz 
always chooses Heads—which P, could exploit severely by 
only choosing Right with Tails. Despite the lack of theo- 
retical guarantees and potentially bad performance, unsafe 
endgame solving is simple and can sometimes produce low- 
exploitability strategies in large games, as we show later. 


We now move to discussing safe endgame solving tech- 
niques, that is, ones that ensure that the exploitability of the 
strategy is no higher than that of the base strategy. 


Re-Solve Refinement 


In Re-solve refinement (Burch, Johanson, and Bowling 
2014), a safe strategy is computed for P> in the endgame 
by constructing an auxiliary game, as shown in Figure 3, 
and computing an equilibrium strategy a% for it. The aux- 
iliary game consists of a starting chance node that connects 
to each history h in S, in proportion to the probability that 
player P, could reach h if P; tried to do so (that is, in pro- 
portion to 77, (h)). Let ag be the action available in A such 
that h -as € S. At this point, P; has two possible actions. 
Action a’g, the auxiliary-game equivalent of ag, leads into 
S, while action a’, leads to a terminal payoff that awards 
the counterfactual best response value from the base strat- 
egy CBV°-1(I(h), ag). In the base strategy of Coin Toss, 
the counterfactual best response value of P) choosing Right 
is 0 if the coin is Heads and 5 if the coin is Tails. Therefore, 
a'r leads to a terminal payoff of 0 for Heads and $ for Tails. 
After the equilibrium strategy o° is computed in the auxil- 
iary game, a3 is copied back to S in the original game (that 
is, P> plays according to 0} rather than o2 when in S). In 
this way, the strategy for P> in S is pressured to be similar 
to that in the original strategy; if P> were to choose a strat- 
egy that did better than the base strategy against Heads but 
worse against Tails, then P, would simply choose a‘, with 
Heads and a’, with Tails. 
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Figure 3: The auxiliary game used by Re-solve refinement to 
determine a P, strategy in the Right endgame of Coin Toss. 


Re-solve refinement is safe and useful for compactly stor- 
ing strategies and reconstructing them later. However, it may 
miss out on opportunities for improvement. For example, if 
we apply Re-solve refinement to our base strategy in Coin 
Toss, we may atrive at the same strategy as the base strat- 
egy in which Player 2 chooses Forfeit 25% of the time, 
even though Heads and Tails dominate that action. The next 
endgame solving technique addresses this shortcoming. 


Maxmargin Refinement 


Maxmargin refinement (Moravcik et al. 2016) is similar to 
Re-solve refinement, except that it seeks to improve the 
endgame strategy as much as possible over the alternative 
payoff. While Re-solve refinement seeks a strategy for Pz 
in S that would simply dissuade P, from entering S, Max- 
margin refinement additionally seeks to punish P, as much 


as possible if P, nevertheless chooses to enter S. A sub- 
game margin is defined for each infoset in S;., which repre- 
sents the difference in value between entering the subgame 
versus choosing the alternative payoff. Specifically, for each 
infoset J € S, and action ag leading to S, the subgame mar- 
gin M(I,ag) = vr (I at) — vr (I, ag), or equivalently 
M(I,as) = CBV?-1(I,a) — v” (I, a'z). In Maxmargin 
refinement, a Nash equilibrium strategy is computed such 
that the minimum margin over all J € S, is maximized. 

Given our base strategy in Coin Toss, Maxmargin refine- 
ment would result in P> choosing Heads with probability 3, 
Tails with probability 2, and Forfeit with probability 0. 

Maxmargin refinement is safe. Furthermore, it guarantees 
that if every Player 1 best response reaches the endgame 
with positive probability through some infoset(s) that have 
positive margin, then exploitability is strictly lower than that 
of the base strategy. 

Still, none of the prior techniques consider that in Coin 
Toss P, can achieve a payoff of 0.5 by choosing Left with 
Heads, and thus has more incentive to reach S when in the 
Tails state. The next section introduces our new technique, 
Reach-Maxmargin refinement, which solves this problem. 


Reach-Maxmargin Refinement 


In this section we introduce Reach-Maxmargin refinement, a 
new method for refining endgames that considers what pay- 
offs are achievable from other paths in the game. We first 
consider the case of refining a single endgame in a game tree. 
We then cover independently refining multiple endgames. 


Refining a Single Endgame 


All of the endgame-solving techniques described in the pre- 
vious section only consider the target endgame in isolation. 
This can be improved by incorporating information about 
what payoffs the players could receive by not reaching the 
endgame. For example in Coin Toss (Figure 1), P4 can re- 
ceive payoff 0.5 by choosing Left in the Heads state, and 
—0.5 in the Tails state. The solution that Maxmargin refine- 
ment produces would result in P} receiving payofi -i by 


choosing Right in the Heads state, and + in the Tails state. 
Thus, P, could simply always choose Left in the Heads state 
and Right in the Tails state against P2’s strategy and receive 
expected payoff 3. Reach-Maxmargin improves upon this. 

The auxiliary game used in Reach-Maxmargin refinement 
requires additional definitions. Define the path Qs(JI) to an 
infoset J € S, to be the set of infosets J’ such that J’ E T 
and J’ is not an ancestor of any other information set in S,.. 
We also define CBR,(o’_,)~!“s as the P) strategy that 
plays to reach I - ag in all infosets J’ © J, and elsewhere 
plays identically to CB Rı (o1). 

We now describe the auxiliary game used in Reach- 
Maxmargin. The auxiliary game begins with a chance node 
that leads to h’ € I’ in proportion to 77, (h’), where I’ is 
the earliest infoset such that I’ € Qs(J) for some I € Sp. 
P; then has a choice between actions a’, and a'g. Action a‘, 
in Reach-Maxmargin refinement leads to a terminal payoff 
of CBV°?-'(J'). Pı can instead take action a'g, which can 


be viewed as P, attempting to reach I - ag from I’. Since 
there may be P> nodes and chance nodes between I’ and 
I, P, may not reach I from I’ with probability 1. If P, 
reaches an infoset I” ¢ Qs (J) that is “off the path” from 
I, then we assume P, plays according to a counterfactual 
best response from that point forward and receives a payoff 
of CBV°-1 (I). However, with probability 77, (h’,h), Pi 
can reach history h - a’g for h € I. From this point on, the 
auxiliary game is identical to that in Re-solve and Maxmar- 
gin refinement. 

Formally, let o’ be the strategy that plays according to 
a’ in S and otherwise plays according to ø. For an infoset 
I € S, and action ag leading to S, let I’ be the earliest 
infoset such that J’ E J and J’ cannot reach an infoset in S, 
other than J. We define a reach margin as 


M,(I, 0,08) = CBV7"(I') — CBV7-17F'45 (J') 


Reach-Maxmargin refinement finds a Nash equilibrium 
g” in the auxiliary game such that the minimum margin 
min; M,(I,as,S) is maximized. Theorem 1 shows that 
Reach-Maxmargin refinement results in a combined strategy 
with exploitability lower than or equal to the base strategy. If 
the opponent reaches a refined endgame with positive prob- 
ability and the margin of the reached infoset is positive, then 
exploitability is strictly lower than that of the base strategy. 
This theorem statement is similar to that of Maxmargin re- 
finement (Moravcik et al. 2016), but the margins here are 
higher than (or equal to) those in Maxmargin refinement. 


Theorem 1. Given a strategy o2, an endgame S for Pz, 
and a refined endgame Nash equilibrium strategy o3, let 
o, be the strategy that plays according to a} in endgame 


S and o2 elsewhere. If min; M, (I,o,os) > 0 for S, then 


exp(o4) < exp(o2). Furthermore, if x\PF°?:72)(I) > 0 
for some I € S, for an endgame S, then exp(o5) < 
exp(o2) — T (T) minz M(J, o}, 8). 

The auxiliary game can be solved in a way that maximizes 
the minimum margin by using a standard LP solver. In order 
to use iterative algorithms such as the Excessive Gap Tech- 
nique (Nesterov 2005; Gilpin, Peña, and Sandholm 2012) or 
Counterfactual Regret Minimization (CFR) (Zinkevich et al. 
2007), one can use the gadget game described by Moravcik 
et al. (2016). Details on the gadget game are provided in the 
Appendix. In our experiments we used CFR. 


Refining Multiple Endgames Independently 


Other endgame solving methods have also considered the 
cost of reaching an endgame (Waugh, Bard, and Bowl- 
ing 2009; Jackson 2014). However, those approaches (and 
the version of Reach-Maxmargin refinement we described 
above) are only correct in theory when applied to a 
single endgame. Typically, we want to refine multiple 
endgames independently—or, equivalently, any endgame 
that is reached at run time. This poses a problem because 
the construction of the auxiliary game assumes that all P> 
nodes outside the endgame have strategies that are fixed ac- 
cording to the base strategy. If this assumption is violated by 
refining multiple endgames, then the theoretical guarantees 
of Reach-Maxmargin refinement no longer hold. 


To address this issue, we first add a constraint that 
CBV°-1(I) < CBV?-*(I) for every P, infoset. This triv- 
ially guarantees that exp(a4) < exp(a2). We also modify 
the Reach-Maxmargin auxiliary game. Let o’ be the strategy 
profile after all endgames are solved and recombined. Ide- 
ally, when solving an endgame S we would like any P} ac- 
tion leading away from S' (that is, any action a belonging to 
an infoset I’ € Qs(J) such that I’-a g Qs(I)US) to lead to 
a terminal payoff of C BVZ (h-a) rather than CBVZ (h-a). 
However, since we are solving the endgames independently, 
we do not know what g’ will be. N evertheless, we can have 
h- a lead to a lower bound on CBVF (h- a). In our ex- 
periments we use the minimum reachable payoff as a lower 
bound.! Tighter upper and lower bounds, or accurate esti- 
mates of C B Ve (I) for an infoset I, may lead to even better 
empirical performance. 

Theorem 2 shows that even though the endgames are 
solved independently, if an endgame has positive minimum 
margin and is reached with positive probability then the final 
strategy will have lower exploitability than without Reach- 
Maxmargin endgame solving on that endgame. 


Theorem 2. Given a strategy 02, a set of disjoint endgames 
S for P>, and a refined endgame Nash equilibrium strat- 
egy o3 for each endgame S € S, let o4 be the strat- 
egy that plays according to a} in each endgame S, re- 
spectively, and oz elsewhere. Moreover, let oz S be the 
strategy that plays according to ot, everywhere except for 
P> nodes in S, where it instead plays according to o2. If 


7 (BR°203) (T) > 0 for some I € S,, then exp(o3) < 
exp(oz°) — 123, (I) minz M(I, 05, S). 


We now introduce an improvement to Reach-Maxmargin 
refinement. Let J’ be an infoset in Qs (T). Let ao be an ac- 
tion leading away from S and let ag be an action leading 


toward S. If the lower bound for CBV°s (I', ao) is higher 
than CBV°S (T', ag) then S will never be reached through 
I’ in a Nash equilibrium. Thus, there is no point in fur- 
ther increasing the margin of J. This allows other margins 
to be larger instead, leading to better overall performance. 
This applies even when refining multiple endgames indepen- 
dently. We use this improvement in our experiments. 


Nested Endgame Solving 


As we have discussed, large games must be abstracted to 
reduce the game to a tractable size. This is particularly 
common in games with large or continuous action spaces. 
Typically the action space is discretized by action abstrac- 
tion so only a few actions are included in the abstraction. 
While we might limit ourselves to the actions we included 
in the abstraction, an opponent might choose actions that 
are not in the abstraction. In that case, the off-tree action 
can be mapped to an action that is in the abstraction, and 
the strategy from that in-abstraction action can be used. This 


'While this may seem like a loose lower bound, there are many 
situations where the off-path action simply leads to a terminal node. 
For these cases, the lower bound we use is optimal. 


is certainly problematic if the two actions are very differ- 
ent, but in many cases it leads to reasonable performance. 
For example, in an auction game we might include a bid 
of $100 in our abstraction. If a player bids $101, we can 
probably treat that as a bid of $100 without major problems. 
This is referred to as action translation (Gilpin, Sandholm, 
and Sørensen 2008; Schnizlein, Bowling, and Szafron 2009; 
Ganzfried and Sandholm 2013). Action translation is the 
state-of-the-art prior approach to dealing with this issue. It is 
used, for example, by all the leading competitors in the An- 
nual Computer Poker Competition (ACPC). The leading ac- 
tion translation mapping—i.e., way of mapping opponent’s 
off-tree actions back to actions in the abstraction—is the 
pseudoharmonic mapping (Ganzfried and Sandholm 2013); 
it has an axiomatic foundation, plays intuitively correctly in 
small sanity-check games, and is used by most of the lead- 
ing teams in the ACPC. That is the action mapping that we 
will benchmark against in our experiments. 

In this section, we develop techniques for applying 
endgame solving to calculate responses to opponent’s off- 
tree actions, thereby obviating the need for action transla- 
tion. We present two methods that dramatically outperform 
the leading action translation technique. The same tech- 
niques can also be used more generally to calculate finer- 
grained card or action abstractions as play progresses down 
the game tree. In this section, for exposition, we assume that 
P> wishes to respond to P, choosing an off-tree action. 

The first method, which we refer to as the inexpensive 
method, begins by calculating a Nash equilibrium o within 
the abstraction, and calculating CBV°-! (J, a) for each in- 
foset J € Z, and action a in the abstraction. When Pi 
chooses an off-tree action a in infoset J, an endgame S 
is generated such that Z € S, and I - a leads to S. This 
endgame may be an abstraction. S is solved using any of the 
safe endgame solving techniques discussed earlier, except 
that we use CBV°-'(J) in place of CBV°-' (J, a) (since 
a is not a valid action in J according to ø). The solution o 
is combined with o to form o’. CBV°"1(J’, a) is then cal- 
culated for each infoset I’ € S and each I’ € Qs(J) (that 
is, on the path to 7). The process repeats whenever P, again 
chooses an off-tree action in S. ; 

By using CBV°?-'(I) in place of CBV°-1(I',a), we 
can retain some of the theoretical guarantees of Reach- 
Maxmargin refinement and Maxmargin refinement. Intu- 
itively, if in every information set J P, is better off tak- 
ing an action already in the game than the new action that 
was added, then the refined strategy is still a Nash equilib- 
rium. Specifically, if the minimum reach margin Mmin of 
the added action is nonnegative, then the combined strategy 
a’ is a Nash equilibrium in the expanded game that contains 
the new action. If Min is negative, then the distance of o’ 
from a Nash equilibrium is proportional to — Mmin- 

This “inexpensive” approach does not apply with Unsafe 
endgame solving because the probability of reaching an ac- 
tion outside of a player’s abstraction is undefined. That is, 
T7 (h - a) is undefined when a is not considered a valid ac- 
tion in h according to the abstraction. Nevertheless, a sim- 
ilar but more expensive approach is possible with Unsafe 
endgame solving (as well as all the other endgame-solving 


techniques) by starting the endgame solving at h rather than 
at h-a. In other words, if action a taken in history A is not in 
the abstraction, then Unsafe endgame solving is conducted 
in the smallest endgame containing h (and action a is added 
to that abstraction). This increases the size of the endgame 
compared to the inexpensive method because a strategy must 
be recomputed for every action a’ € A(h) in addition to a. 
For example, if an off-tree action is chosen by the opponent 
as the first action in the game, then the strategy for the entire 
game must be recomputed. We therefore refer to this method 
as the expensive method. We present experiments with both 
methods. 


Experiments 


We conducted our experiments on a poker game we call No- 
Limit Flop Hold’em (NLFH). NLFH is similar to the popu- 
lar poker game of No-Limit Texas Hold’em except that there 
are only two rounds, called the pre-flop and flop. At the be- 
ginning of the game, each player receives two private cards 
from a 52-card deck. Player 1 puts in the “big blind” of 100 
chips, and Player 2 puts in the “small blind” of 50 chips. 
A round of betting then proceeds starting with Player 2, re- 
ferred to as the preflop, in which an unlimited number of bets 
or raises are allowed so long as a player does not put more 
than 20,000 chips (i.e., her entire chip stack) in the pot. Ei- 
ther player may fold on their turn, in which case the game 
immediately ends and the other player wins the pot. After the 
first betting round is completed, three community cards are 
dealt out, and another round of betting is conducted (start- 
ing with Player 1), referred to as the flop. At the end of this 
round, both players form the best possible five-card poker 
hand using their two private cards and the three community 
cards. The player with the better hand wins the pot. 

For equilibrium finding, we used a version of CFR called 
CFR+ (Tammelin et al. 2015) with the speed-improvement 
techniques introduced by Johanson et al. (2011). There is no 
randomness in our experiments. 

Our first experiment compares the performance of un- 
safe, re-solve, maxmargin, and reach-maxmargin refinement 
when applied to information abstraction (which is card ab- 
straction in the case of poker). Specifically, we solve NLFH 
with no information abstraction on the preflop. On the flop, 
there are 1,286,792 infosets for each betting sequence; the 
abstraction buckets them into 30,000 abstract ones (using 
a leading information abstraction algorithm (Ganzfried and 
Sandholm 2014)). We then apply endgame solving imme- 
diately after the preflop ends but before the flop commu- 
nity cards are dealt. We experiment with two versions of the 
game, one small and one large, which include only a few of 
the available actions in each infoset. The small game has 9 
non-terminal betting sequences on the preflop and 48 on the 
flop. The large game has 30 on the preflop and 172 on the 
flop. Table 1 shows the performance of each technique. In all 
our experiments, exploitability is measured in the standard 
units used in this field: milli big blinds per hand (mbb/h). 

Despite lacking theoretical guarantees, Unsafe endgame 
solving outperformed the safe methods in the small game. 
However, it did substantially worse in the large game. This 


Small Game | Large Game 
Base Strategy 9.128 4.141 
Unsafe 0.5514 39.68 
Resolve 8.120 3.626 
Maxmargin 0.9362 0.6121 
Reach-Maxmargin | 0.8262 0.5496 


Table 1: Exploitability (evaluated in the game with no infor- 
mation abstraction) of the endgame-solving techniques. 


exemplifies its variability. Among the safe methods, our 
Reach-Maxmargin technique performed best on both games. 

The second experiment evaluates nested endgame solving 
using the different endgame solving techniques, and com- 
pares them to action translation. In order to also evaluate 
action translation, in this experiment, we create an NLFH 
game that includes 3 bet sizes at every point in the game 
tree (0.5, 0.75, and 1.0 times the size of the pot); a player 
can also decide not to bet. Only one bet (i.e., no raises) is 
allowed on the preflop, and three bets are allowed on the 
flop. There is no information abstraction anywhere in the 
game. ? We also created a second, smaller abstraction of 
the game in which there is still no information abstraction, 
but the 0.75x pot bet is never available. We calculate the 
exploitability of one player using the smaller abstraction, 
while the other player uses the larger abstraction. When- 
ever the large-abstraction player chooses a 0.75x pot bet, the 
small-abstraction player generates and solves an endgame 
for the remainder of the game (which again does not in- 
clude any 0.75x pot bets) using the nested endgame solving 
techniques described above. This endgame strategy is then 
used as long as the large-abstraction player plays within the 
small abstraction, but if she chooses the 0.75x pot bet later 
again, then the endgame solving is used again, and so on. 
Table 2 shows that all the endgame solving techniques sub- 
stantially outperform action translation. Resolve, Maxmar- 
gin, and Reach-Maxmargin use inexpensive nested endgame 
solving, while Unsafe and “Reach-Maxmargin (expensive)” 
use the expensive approach. Reach-Maxmargin refinement 
performed the best, outperforming maxmargin refinement 
and unsafe endgame solving. These results suggest that 
nested endgame solving is preferable to action translation 
(if there is sufficient time to solve the endgame). 


Conclusion 


We introduced an endgame solving technique for imperfect- 
information games that has stronger theoretical guarantees 


>There are no chip stacks in this version of NLFH. Chip stacks 
pose a considerable challenge to action translation, because the op- 
timal strategy in a poker game can change drastically when any 
player has bet almost all her chips. Since action translation maps 
each bet size to a bet size in the abstraction, it may significantly 
overestimate or underestimate the number of chips in the pot, and 
therefore perform extremely poorly when near the chip stack limit. 
Refinement techniques do not suffer from the same problem. Con- 
ducting the experiments without chip stacks is thus conservative 
in that it favors action translation over the endgame solving tech- 
niques. We nevertheless show that the latter yield significantly bet- 
ter strategies. 


Exploitability 
Randomized Pseudo-Harmonic Mapping | 146.5 
Resolve 15.02 
Reach-Maxmargin (Expensive) 14.92 
Unsafe (Expensive) 14.83 
Maxmargin 12.20 
Reach-Maxmargin 11.91 


Table 2: Comparison of the various endgame solving tech- 
niques in nested endgame solving. The performance of 
the pseudo-harmonic action translation is also shown. Ex- 
ploitability is evaluated in the large action abstraction, and 
there is no information abstraction in this experiment. 


and better practical performance than prior endgame-solving 
methods. We presented results on exploitability of both safe 
and unsafe endgame solving techniques. We also introduced 
a method for nested endgame solving in response to the op- 
ponent’s off-tree actions, and demonstrated that this leads to 
dramatically better performance than the usual approach of 
action translation. This is, to our knowledge, the first time 
that exploitability of endgame solving techniques has been 
measured in large games. 
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Appendix: Supplementary Material 
Description of Gadget Game 


Solving the auxiliary game described in Maxmargin Refine- 
ment and Reach-Maxmargin Refinement will not, by itself, 
maximize the minimum margin. While LP solvers can easily 
handle this objective, the process is more difficult for itera- 
tive algorithms such as Counterfactual Regret Minimization 
(CFR) and the Excessive Gap Technique (EGT). For these 
iterative algorithms, the auxiliary game can be modified into 
a gadget game that, when solved, will provide a Nash equi- 
librium to the auxiliary game and will also maximize the 
minimum margin (Moravcik et al. 2016). 

The gadget game differs from the auxiliary game in two 
ways. First, all P; payoffs that are reached from the ini- 
tial information set of I’ are shifted by CBV?-1(J',a) 
in Maxmargin refinement and by CBV?-!(JI’) in Reach- 
Maxmargin refinement. Second, rather than the game start- 
ing with a chance node that determines P,’s starting state, 
P, will get to decide for herself which state to begin the 
game in. Specifically, the game begins with a P; node where 
each action in the node corresponds to an information set 
I in S, for Maxmargin refinement, or the earliest infoset 
I' € Qs(I) for Reach-Maxmargin refinement. After P, 
chooses to enter an information set J, chance chooses the 
precise history h € I in proportion to 773" (h). 

By shifting all payoffs by CBV°-1(I',a) or 
CBV°-‘(I'), the gadget game forces P; to focus on 
improving the performance of each information set over 
some baseline, which is the goal of Maxmargin and Reach- 
Maxmargin refinement. Moreover, by allowing P, to choose 
the state in which to enter the game, the gadget game forces 
P to focus on maximizing the minimum margin. 

Figure 4 illustrates the gadget game for Maxmargin re- 
finement. 


Proof of Theorem 1 
Proof. Assume M,.(I,0,0g5) > 0 for every information set 
I in S, for an endgame S and let € = min; M, (I, 0,05). 

For an information set J € S,, let I’ be the ear- 
liest information set in Qs(I). Then CBV?-1(I') > 
CBV—-F4a(7") + e. 

First suppose that m(BE(e2)02) (T) = 0. Then either 
mi BR(99),79) (T) = 0 or m BEC2)02) (T, T) = 0. If it is the 
former case, then CBV?-1(J’) does not affect exp(o4). If it 
is the latter case, then since J is the only information set in 
S, reachable from J’, so in any best response I’ only reaches 
nodes outside of S with positive probability. The nodes out- 
side S' belonging to P> were unchanged between o and o’, 
so CBV” (I) < CBV?-1(I’'). 

Now suppose that m(BEC2)02) (T) > 0. 
Since BR(øo4) already reaches J on its own, 
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Figure 4: An example of a gadget game in Maxmargin re- 
finement. P, picks the initial information set she wishes 
to enter S, in. Chance then picks the particular history 
of the information set, and play then proceeds identi- 
cally to the auxiliary game. All P, payoffs are shifted by 
CBV"-\(I',a). 


so CBV”™=(I) = CBV?7-i7'4s(I'). Since 
CBV*-(I') > CBV?-i>'4s(I') + € so we get 
CBV?-(I') > CBV°?-i(I') + e. This is the condition 
for Theorem 1 in Moravcik et al. (2016). Thus, from that 
theorem, we get that exp(o4) < exp(a2) — en”? (I). 


Now consider any information set J” C J’. Before en- 
countering any P> nodes whose strategies are different in 
o’ (that is, P> nodes in S), P) must first traverse a J’ in- 


formation set as previously defined. But for every I’ in- 
formation set, CBV°-1(I') < CBV?-1(I'). Therefore, 
CBV” (I") < CBV7-1(I"). 


Proof of Theorem 2 

Proof. Let S € S be an endgame for P and as- 
sume 1‘PF".93)(T) > 0 for some I € Sp. Let € = 
min; M,(I,o,os) and let J’ be the earliest information set 
in Qs (J). Since we added the constraint that CBR7-1(I) < 
CBR°-'(I) for all P, information sets, so €e > 0. We 
only consider the non-trivial case where €e > 0. Since 
BR(o3) already reaches I’ on its own, so CBV?-(I') = 
CRV? «Pte (1)), 

Let 04° represent the strategy which plays according to 
a% in P nodes of S and elsewhere plays according to 
ao. Since €e > O and we assumed the minimum payoff 
for every P, action in Qg(J) that does not lead to I, so 
CBVO>145 (J) < BRV (I') — e. 

Moreover, since o’°, assumes a value of CBV?-1(h) is 
received whenever a history h ¢ Qs(JI) is reached due 
to chance or P2, and CBV°-1(h) is an upper bound on 
CBV”: (h), so CBV7Č T45 (J') > CBV?-179-45 (I), 

Thus, CBV?-17'45(J’) < BRV”=Ť (I') — e. Finally, 
since I’ can be reached with probability 7°-1(JI’), so 
exp(a}) < exp(a5°) — 123 (I) min; M (I, o3, 8). 


